International External Validation of Risk Prediction Model of 90-Day Mortality after Gastrectomy for Cancer Using Machine Learning

Simple Summary A 90-day mortality predictive model for curative gastric cancer resection based on the Spanish EURECCA Esophagogastric Cancer database was externally validated using the GASTRODATA registry. The externally validated model showed a modestly worse performance compared to the original model, nevertheless maintaining its discriminating ability in clinical practice. Abstract Background: Radical gastrectomy remains the main treatment for gastric cancer, despite its high mortality. A clinical predictive model of 90-day mortality (90DM) risk after gastric cancer surgery based on the Spanish EURECCA registry database was developed using a matching learning algorithm. We performed an external validation of this model based on data from an international multicenter cohort of patients. Methods: A cohort of patients from the European GASTRODATA database was selected. Demographic, clinical, and treatment variables in the original and validation cohorts were compared. The performance of the model was evaluated using the area under the curve (AUC) for a random forest model. Results: The validation cohort included 2546 patients from 24 European hospitals. The advanced clinical T- and N-category, neoadjuvant therapy, open procedures, total gastrectomy rates, and mean volume of the centers were significantly higher in the validation cohort. The 90DM rate was also higher in the validation cohort (5.6%) vs. the original cohort (3.7%). The AUC in the validation model was 0.716. Conclusion: The externally validated model for predicting the 90DM risk in gastric cancer patients undergoing gastrectomy with curative intent continues to be as useful as the original model in clinical practice.


Introduction
Despite a significant decline in its incidence in recent years, gastric cancer remains the fourth leading cause of cancer death worldwide [1].Surgical intervention continues to be the primary potentially curative option for patients with gastric cancer, even in the setting of multimodal treatment [2].This intervention in benchmark patients is associated with an overall morbidity rate of 16.2% and with 30-and 90-day mortality rates of 0.3% and 0.5%, respectively [3].Though, in other series, morbidity has risen to 20-45% [4][5][6][7] and mortality to 2-7% rates [4,6,7].
An accurate preoperative risk assessment for these procedures is important to help with the selection of patients.However, in gastric cancer surgery, few risk prediction models have been developed [8].Most models focus on predicting survival following a curative resection, whereas only few studies have been conducted to predict operative mortality [9][10][11][12][13].Moreover, the majority of these studies are based on classical logistic regression or Cox regression analysis, even though artificial intelligence (AI)-related tools are now available and being increasingly used to assist clinicians in providing tailor-made treatment decisions [14].
Additionally, it is important to mention that despite the growing number of predictive models (classical or developed with AI), their quality and clinical impact are often insufficient, also because of the lack of an external validation that would guarantee validity and clinical applicability [14].The external validation of a risk prediction algorithm, in fact, is an important step in the process of building and evaluating a model, since it provides information about the reproducibility and generalizability of the model and assures its clinical applicability [14].In gastric cancer surgery, only 13% of the predictive models Cancers 2024, 16, 2463 3 of 13 developed have undergone a high-quality validation [8].External validation is rarely performed because of its practical difficulty (need for multi-institutional collaboration across different geographic regions to achieve datasets of external cohorts in different settings) [15] and because of discriminative ability reduction in validation studies, which makes them unattractive for publication [8].
A clinical model for predicting the risk of 90-day mortality (90DM) after gastrectomy using AI was recently developed.The model showed an excellent performance (AUC 0.829) in the original cohort [16], but external validation of the risk prediction algorithm is necessary to provide information on its reproducibility and generalizability (or transportability), as well as to define its clinical applicability [14,17].To our knowledge, external validation studies of ML models in the setting of gastric cancer surgery have not been previously reported [18].
The objective of the study was to perform an external validation of a 90DM risk prediction model using ML in gastric cancer patients undergoing gastrectomy with curative intent using a cohort from the European GASTRODATA database.

Materials and Methods
This study conformed to the TRIPOD10 (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) reporting guidelines (Appendix S1) [19].

Study Development Cohort
The cohort for which the risk prediction model was derived has been previously described [16].Briefly, data were retrieved from the Spanish EURECCA Esophagogastric Cancer Registry (SEEGCR) that covers data from 39 public hospitals of the National Health Care System from six regions in Spain, covering nearly a population of 14 million inhabitants.The SEEGCR database was audited for the 2014-2017 period with a completeness of 97% and data accuracy of 95% [20].The SEEGCR is linked to the EURECCA Upper Gastrointestinal network, a multi-institutional population-based cohort registry that collects prospective clinical data from all patients with primary esophageal, gastro-esophageal junction (GEJ), and gastric cancer undergoing resection with curative intent.

Validation Cohort
For the present study of multi-institutional validation, data were collected from the European GASTRODATA database.The registry collects retrospective and prospective clinical data from patients with primary gastric cancer, including cancer of the GEJ, that underwent surgical resection with curative intent between 2015 and 2022, in 25 hospitals from 11 European countries.As in the SEEGCR database, patients' information was collected using an online platform (www.gastrodata.org,accessed on 5 September 2022) in which the following six sections had to be completed: (1) clinical features, (2) oncological characteristics and surgical data, (3) perioperative complications, (4) outcome at hospital discharge, and (5) outcome at 30 and 90 days postoperatively [5].
In fact, most variables used in the development of the model were also available in the GASTRODATA registry.Moreover, both registries used the same definition criteria for these variables, especially for those related to complications and outcome measures [21].

Ethics
The local ethics committees of the centers participating in each of the registries (SEEGCR and GASTRODATA) approved the collection of anonymized data.The scientific committee of the GASTRODATA group approved sharing the dataset for the external validation project.

Eligibility and Primary Outcome
All patients with primary gastric or GEJ cancer (excluding Siewert 1 tumors) who underwent gastrectomy (partial or total) with curative intent included in the GASTRODATA registry from 2015 to 2022 were eligible.The primary outcome was 90DM defined as all-cause mortality within 90 days after surgery.

Predictor Characteristics and Statistical Analysis
The preoperative variables of the SEEGCR database used for the development of the original ML-based algorithm were also obtained from the GASTRODATA registry and compared each other.The principal investigators of the GASTRODATA centers were requested to retrieve some missing variables or variables not available in the registry, such as preoperative hemoglobin level and center volume.Age, body mass index (BMI), hemoglobin and albumin serum levels, and hospital volume activity (number of gastrectomies per center per year) were considered as continuous variables.The remaining variables (gender, BMI index, weight loss, ASA score, ECOG score, tumor location, clinical stage, neoadjuvant therapy, minimally invasive or open approach, subtotal or total gastrectomy, elective or urgent surgery, comorbidity as renal disease, pulmonary disease, peripheral vascular disease, myocardial infarction, diabetes mellitus, cerebrovascular disease, congestive heart failure, peptic ulcer disease, malignant lymphoma, dementia, liver disease, connective tissue disease, leukemia, hemiplegia, AIDS, malignant tumor, and metastatic tumor) were categorized as dichotomous variables by using one-hot encoding [22].Missing data were imputed by including a separate category of predictor variables that had missing values [23].Descriptive statistics are presented as means and standard deviations or numbers and percentages for continuous and categorical variables, respectively.Differences between the groups of patients who survived and those who died within 90 postoperative days were evaluated using the Fisher's exact test for categorical variables or the Kolmogorov-Smirnov test for continuous variables.Statistical significance was set at p < 0.05.

External Validation of the Predictive Model
Trained models developed in the previous study (Random Forest, cv-Enet, and glmboost, ensemble) [16] were used on the external validation set.Briefly, cv-Enet (Cross Validated Elastic net regularized logistic regression) [24] is an algorithm that determines the optimal coefficients for lasso and ridge penalties through internal cross-validation, whereas RF (Random Forest) and glmboost are composed of decision trees or a generalized linear model fitted with a boosting algorithm, respectively [25][26][27].Finally, the ensemble model uses the 3 previous models combined with a linear blend of predicted probabilities using logistic regression.The discrimination of the models on the external validation dataset was assessed using the area under the curve (AUC).Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the precision-recall curve (AUPRC) were also reported for each model.In order to assess the feature attributions for each variable on the model testing, the "predict parts" function from the DALEX was used.[28].For each sample, the absolute features' attributions were calculated and averaged on the whole cohort.Data analysis was performed using R software version 4.2.0 (R Foundation for Statistical Computing, Vienna, Austria).The models were validated using mlr3 package [29].

Results
A total of 2595 patients from 25 hospitals in 11 European countries were included in the GASTRODATA database over an 8-year period (2015-2022), with 90-day follow-up available for all patients.Patients from the Hospital del Mar registered in the GASTRODATA registry were excluded because they were part of the development cohort.Finally, 2546 patients from 24 hospitals in 11 European countries were included for the analysis (Supplementary Table S1).The overall rate of missing data for variables was 4% (3215 items in 86,564 cells).The most frequently missing characteristics were preoperative albumin (n = 668 [26%]) and Eastern Cooperative Oncology Group (ECOG) score performance status (n = 554 [21%]).
Table 1 shows data on the preoperative variables for the development and external validation cohorts.The mortality rate in the GASTRODATA cohort was lower than that in the SEEGCR, indeed, 3.7% (95 patients) versus 5.6% (179 patients) of the SEEGCR died within 90 days.Age, BMI, and the rates of congestive heart failure, chronic obstructive pulmonary disease (COPD), cerebrovascular disease, complicated diabetes mellitus, leukemia, malignant lymphoma, and liver disease were significantly lower in the GASTRODATA cohort.Furthermore, the GASTRODATA patients more frequently had a lower ECOG performance status and American Society of Anesthesiologists (ASA) score, with higher percentages of weight loss and more advanced clinical T and N stages.Regarding the localization of the tumor, there were more cases of linitis plastica and GEJ tumors.Additionally, elective and open procedures were more commonly performed in the external validation cohort, as well as neoadjuvant treatment and total gastrectomy.The mean volume of the centers was higher in the external validation cohort.

Model Performance: Discrimination
Table 2 summarizes all the precision metrics obtained with the random forest model, which was the model with the best performance both on the development and the external validation cohorts (Figure 1).The AUCs for the development and external validation cohorts were 0.844 and 0.716, respectively, leading to a 11.3% performance reduction.The precision metrics obtained with the other models (cv-Enet, glmboost, and ensemble) are shown in Supplementary Table S2.

Variable Importance
A feature attribution analysis on the external validation dataset was assessed by decomposing the model predictions using variable-attribution measures that could be assigned to specific variables.The most important factors for the prediction were age, ASA score, volume center, preoperative serum albumin level, ECOG, preoperative serum hemoglobin level, and neoadjuvant treatment (Figure 2).

Variable Importance
A feature attribution analysis on the external validation dataset was assessed by decomposing the model predictions using variable-attribution measures that could be assigned to specific variables.The most important factors for the prediction were age, ASA score, volume center, preoperative serum albumin level, ECOG, preoperative serum hemoglobin level, and neoadjuvant treatment (Figure 2).

Variable Importance
A feature attribution analysis on the external validation dataset was assessed by decomposing the model predictions using variable-attribution measures that could be assigned to specific variables.The most important factors for the prediction were age, ASA score, volume center, preoperative serum albumin level, ECOG, preoperative serum hemoglobin level, and neoadjuvant treatment (Figure 2).

Discussion
We conducted an external validation of the ML-based SEEGCR risk prediction model of 90DM on patients undergoing gastric cancer resection with curative intent using the GASTRODATA registry, a large multicenter European database.To our knowledge, this is the first external validation study of an ML-based model for the prediction of mortality in the field of gastric cancer surgery.The AUC for the external validation cohort was 0.716, which is lower than those achieved previously on the development (0.844) and internal-external validation (0.829) cohorts.However, this drop in performance may not invalidate the usefulness of having available an additional tool for assessing the prognosis of surgical patients with gastric cancer.
The external validation of a risk prediction algorithm is important to assess the clinical applicability of the model in similar (reproducibility) or different populations (generalizability or transportability) [14,17].Despite the growing interest in developing predictive models in clinical practice, a recent review provided a summary of the state of the art of AI-enabled decision support in surgery and found that, among 36 studies, external validation was performed in only 5 of them (13.8%)[18].In the field of esophagogastric cancer surgery, the discriminative ability of models was significantly lower in the validation than in the development phase [8].In an evaluation of the external validation processes of 31 prediction models of different conditions (cardiovascular diseases, gastrointestinal-related diseases, malignancies, and other) [30], it was shown that the AUC decreased on average by 0.062, which, in fact, would be quite similar to the AUC higher than 0.716 found in our study.The limited number of external validation studies may be explained by two reasons, such as difficulties in obtaining external cohorts with a sufficiently large sample size and the performance of the validation model with a discriminating ability usually being inferior to that found in the development model.
A collaboration between the SEEGCR and the GASTRODATA registry allowed us to use their dataset with 2546 cases for the external validation, which conforms to the recommendation of having a cohort of at least 1000 patients for the validation [18].How-ever, both registries present differences.First, the SEEGCR is a population-based registry that includes all consecutive patients operated on in all centers from six Spanish regions, representing real-world practice, whereas the GASTRODATA registry includes a selection of patients operated on in 25 medium-and high-volume hospitals from 11 European countries.Second, an overall assessment of the relatedness between the development and the external validation samples revealed case mix differences of predictor variables, as well as a different outcome (90DM) occurrence.While patients in the SEEGCR appeared to be in poorer physical conditions (older, worse ECOG and ASA scores, and more comorbidities), patients in the GASTRODATA cohort had more advanced clinical T and N stages, more frequently received neoadjuvant treatment, and had more elective and open procedures, with total gastrectomy as the most common procedure.Additionally, the mean volume of the participating hospitals was significantly higher in the external validation cohort.The mortality rates in the GASTRODATA registry were lower than those in the SEEGCR registry.This may be explained by the higher volume of hospitals contributing to the GASTRODATA as compared to the heterogeneity of the volume and technologic level of hospitals participating in the SEEGCR [20].
It is still important to note that the AUC alone may not provide a complete picture of the predictive performance of a model, as it does not take into account factors such as the model calibration or prevalence of the outcome being predicted.Therefore, it is typically recommended to consider other performance metrics in addition to the AUC, such as sensitivity, specificity, predictive values, and calibration measures [18].Another important performance metric is the area under the precision-recall curve (AUPRC), which is based on the PPV value and sensitivity and evaluates how well a model can identify positive examples in a dataset.The importance of AUPRC relies on the fact that it maintains its strength even under imbalanced datasets, mostly in datasets in which relatively rare events are predicted [31,32].Based on metrics data, the RF model is the best model to identify patients at risk of 90DM, as it showed the highest PPV together with the lowest sensitivity and the highest AUPRC in the GASTRODATA cohort.
The current study provides insights into the additional value of particular input variables to predict the risk of 90DM.The differences between the values of the variables detected in the development and validation cohorts were minimal, and four of the most important factors (age, volume, and preoperative serum levels of hemoglobin and albumin) were shared by the two cohorts.These four variables were also clinically relevant and easy to obtain at the bedside.
Several potential limitations of the study are noted.First, the GASTRODATA registry includes a selection of patients undergoing gastrectomy at the different participating hospitals, and not all patients were consecutively recruited (it has been estimated that 396 cases are missing based on the mean real volumes reported by each hospital).Secondly, there was a difference in the quality of the datasets.Indeed, the GASTRODATA has not undergone an audited process, in contrast to the SEEGCR registry that was audited (period 2014-2017) with a 97% and 95% of completeness and data accuracy, respectively [20].A third limitation is the overall rate of missing data of 4% in the GASTRODATA dataset (3215 items in 86,564 cells) and 0.6% (677 items in 101,824 cells) in the SEEGCR.This higher rate of missing data could also be explained due to some differences in the classification of variables.For example, in GASTRODATA, the variables "leukemia" and "malignant lymphoma" were collected as the same variable, and the option "cNx" in "Tumor cN stage" was not considered in SEEGCR.In both cases, data were recorded as missing.Additionally, it should be noted that 11.8% of the validation cohort were classified as ASA I.It is probable that ASA scores would have been underestimated because patients with cancer may fit in the ASA II score as they already have a systemic disease.A fourth limitation is the few events in the external validation cohort, 95 deaths at 90 days (compared with 179 of the SEEGCR), at the threshold of the minimum required number of events (100) and well below the optimal number (>250) [14].

Conclusions
In conclusion, the ML-based algorithm of the SEEGCR registry for predicting the risk of 90DM in patients undergoing gastric cancer surgery with curative intent performed modestly worse in a European multi-institutional-based external validation study.However, the predictive model continues to be useful to assess the post-surgical clinical outcome in this population.The external validation of the 90DM predictive model adds value to the original instrument.

Figure 1 .
Figure 1.Model discriminations in both the development (a) and the external validation (b) cohorts.The AUC for random forest (RF) model in the development cohort was 0.844 (95% confidence interval [CI] 0.84-0.85)as compared with an AUC of 0.716 (95% confidence interval [CI] 0.66-0.77) of the external validation cohort.

Figure 1 .
Figure 1.Model discriminations in both the development (a) and the external validation (b) cohorts.The AUC for random forest (RF) model in the development cohort was 0.844 (95% confidence interval [CI] 0.84-0.85)as compared with an AUC of 0.716 (95% confidence interval [CI] 0.66-0.77) of the external validation cohort.

Figure 2 .
Figure 2. Feature attribution of RF model.Normalized mean of absolute feature attributions of all factors of the GASTRODATA cohort on the random forest (RF) model.

Figure 2 .
Figure 2. Feature attribution of RF model.Normalized mean of absolute feature attributions of all factors of the GASTRODATA cohort on the random forest (RF) model.

Table 1 .
Potential risk factors for 90-day mortality in the development and external validation cohorts.
$ At the time of diagnosis; & According to the seventh edition of the AJCC; AIDS indicates acquired immune deficiency syndrome; ASA, American Society of Anesthesiologists; ECOG, Eastern Cooperative Oncology Group; SD, standard deviation.

Table 2 .
Performance metrics from the development and external validation cohorts for the Random Forest (RF) model.
Abbreviations: AUC, Area under the curve; PPV, positive predictive value; NPV, negative predictive value; AUPRC, area under the precision recall curve; and CI: confidence interval.Cancers 2024, 16, x FOR PEER REVIEW 9 of 13